Toward Pan-Slavic NLP: Some Experiments with Language Adaptation

نویسنده

  • Serge Sharoff
چکیده

There is great variation in the amount of NLP resources available for Slavic languages. For example, the Universal Dependency treebank (Nivre et al., 2016) has about 2 MW of training resources for Czech, more than 1 MW for Russian, while only 950 words for Ukrainian and nothing for Belorussian, Bosnian or Macedonian. Similarly, the Autodesk Machine Translation dataset only covers three Slavic languages (Czech, Polish and Russian). In this talk I present a general approach, which can be called Language Adaptation, similarly to Domain Adaptation. In this approach, a model for a particular language processing task is built by lexical transfer of cognate words and by learning a new feature representation for a lesser-resourced (recipient) language starting from a better-resourced (donor) language. More specifically, I demonstrate how language adaptation works in such training scenarios as Translation Quality Estimation, Part-of-Speech tagging and Named Entity Recognition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test...

متن کامل

Annotation Adaptation and Language Adaptation in NLP

Adaptation technologies are always useful in NLP when there is discrepancy between the training scenario and use scenario. They are also effective in alleviating the data scarcity problem. Domain adaptation is the most popular kind of adaptation technologies and is intensively researched. In this talk we will introduce two other kinds of adaptation technologies: annotation adaptation and langua...

متن کامل

Lexicon Induction for Spoken Rusyn - Challenges and Results

This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vow...

متن کامل

Automated Narrative Information Extraction Using Non-Linear Pipelines

Our research focuses on the problem of automatically acquiring structured narrative information from natural language. We have focused on character extraction and narrative role identification from a corpus of Slavic folktales. To address natural language processing (NLP) issues in this particular domain we have explored alternatives to linear pipelined architectures for information extraction,...

متن کامل

The Gun Violence Database: A new task and data set for NLP

We argue that NLP researchers are especially well-positioned to contribute to the national discussion about gun violence. Reasoning about the causes and outcomes of gun violence is typically dominated by politics and emotion, and data-driven research on the topic is stymied by a shortage of data and a lack of federal funding. However, data abounds in the form of unstructured text from news arti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017